Don't Use a Lot When Little Will Do: Genre Identification Using URLs

Authors: Pattisapu Nikhil Priyatam, Srinivasan Iyengar, Krish Perumal, Vasudeva Varma

Research in Computing Science, Vol. 70, pp. 207-218, 2013.

Abstract: The ever increasing data on world wide web calls for the use of vertical search engines. \textit{Sandhan} is one such search engine which offers search in tourism and health genres in more than 10 different Indian languages. In this work we build a URL based genre identification module for \textit{Sandhan}. A direct impact of this work is on building focused crawlers to gather Indian language content. We conduct experiments on tourism and health web pages in \textbf{Hindi} language. We experiment with three approaches - list based, naive Bayes and incremental naive Bayes. We evaluate our approaches against another web page classification algorithm built on the parsed text of manually labeled web pages. We find that incremental naive Bayes approach outperforms the other two. While doing our experiments we work with different features like words, n-grams and all grams. Using n-gram features we achieve classification accuracies of 0.858 and 0.873 for tourism and health genres respectively.

Keywords: Genre Identification, Focused Crawlers, Web Page Classification

PDF: Don't Use a Lot When Little Will Do: Genre Identification Using URLs
PDF: Don't Use a Lot When Little Will Do: Genre Identification Using URLs